Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(arrow-ipc) Add: Support FileReader and StreamReader skip array data validation #6938

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

totoroyyb
Copy link

@totoroyyb totoroyyb commented Jan 4, 2025

Which issue does this PR close?

Rationale for this change

Beforehand, array data validation is performed for each array creation when reading an IPC file (or stream), which comes with a significant overhead. In some cases, this overhead is unwanted or the file content is trusted.

There are existing functions defined in the codebase to avoid data validation but this is not exposed to the upper level APIs. This PR brings options for both FileReader and StreamReader to disable it.

What changes are included in this PR?

  • Add skip_validations, as an argument, to multiple functions signatures.
  • Provide a new constructor try_new_unvalidated to FileReader
  • Provide a new function with_skip_validations to StreamReader
  • Various related minor changes.

Are there any user-facing changes?

No, I don't think so. There are no API-breaking changes, and essentially, two new APIs are introduced. Other changes are on the internal codes I believe.

@github-actions github-actions bot added the arrow Changes to the arrow crate label Jan 4, 2025
@totoroyyb
Copy link
Author

totoroyyb commented Jan 4, 2025

FYI, clippy does complain about too many arguments on one or two function signatures atm. I just want to give this PR a go for now.

Copy link
Contributor

@tustvold tustvold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately as implemented this is unsound, in that it permits UB via safe functions, this probably needs a bit more thought

@@ -79,6 +79,7 @@ fn create_array(
field: &Field,
variadic_counts: &mut VecDeque<i64>,
require_alignment: bool,
skip_validations: bool,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically all of these functions are now unsound and must be marked unsafe. This is not ideal, and probably needs a bit more thought into handling this better

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One possibility might be to create a new function like:

fn unsafe create_array_unchecked(
    reader: &mut ArrayReader,
    field: &Field,
    variadic_counts: &mut VecDeque<i64>,
    require_alignment: bool,
    skip_validations: bool,
) -> Result<ArrayRef, ArrowError> {

And change the existing function to call it:

fn create_array(
    reader: &mut ArrayReader,
    field: &Field,
    variadic_counts: &mut VecDeque<i64>,
    require_alignment: bool,
) -> Result<ArrayRef, ArrowError> {
  
    let skip_validations = false;
    // safety: enable validatiions when checking
    unsafe { create_array_unchecked(reader, field, variadic_counts, require_aligment, skip_validations) };
}

This is kind of messy though -- finding some way to encapsulate the settings into a struct would be better

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pls check the latest commit, which separating unsafe codes.

Each function like create_array, now has a unsafe version that skips validation. Codebase is not idea, but currently I don't have much better idea to encapusulate it.

arrow-ipc/src/reader.rs Show resolved Hide resolved
@totoroyyb
Copy link
Author

totoroyyb commented Jan 5, 2025

Refactor all unsafe codes into separate functions. It is a bit messy as mentioned @alamb mentioned. From bottom-to-up, there are many code movings and renames.

Now, FileReader has a new wrapper struct:

/// An iterator over the record batches (without validation) in an Arrow file
pub struct UnvalidatedFileReader<R: Read + Seek> {
    reader: FileReader<R>,
}

impl<R: Read + Seek> Iterator for UnvalidatedFileReader<R> {
    type Item = Result<RecordBatch, ArrowError>;

    fn next(&mut self) -> Option<Self::Item> {
        if self.reader.current_block < self.reader.total_blocks {
            // Use the unsafe `maybe_next_unvalidated` function
            unsafe {
                match self.reader.maybe_next_unvalidated() {
                    Ok(Some(batch)) => Some(Ok(batch)),
                    Ok(None) => None, // End of the file
                    Err(e) => Some(Err(e)),
                }
            }
        } else {
            None
        }
    }
}

This still unsound iteractor implementation, but I am not sure how to express it so that we can use iterator with validation disabled.

Any thoughts or suggestions?

@alamb
Copy link
Contributor

alamb commented Jan 10, 2025

Thanks @totoroyyb - I will try and review this PR in the next day or two.

@alamb
Copy link
Contributor

alamb commented Jan 10, 2025

I am starting to review this PR now -- thank you for your patience

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First of all, thank you @totoroyyb -- I think this is a very nice feature and it would be great to get in. The current code is quite nicely docmented

At a minimum I think the PR needs

  1. Tests: There should be tests that demonstrate that reading / writing data while skipping validation still works functionally. This includes at a minimum testing any new pub apis
  2. Benchmarks: given the goal of this PR is to improve performance I think we need a benchmark to show it is actually improving the speed. Unfortunately, I wasn't able to find any existing arrow-ipc benchmarks, but can help draft one shortly

I would like to suggest we break it down a little differently to minimize API changes

  1. Avoid adding ArrayDataBuilder::build_aligned_unchecked and instead simply add a flag on the builder. I will try and add a PR on this shortly
  2. Consider adding flags to StreamReader and FilerReader rather than new structs like UnvalidatedStreamReader

arrow-ipc/src/reader.rs Outdated Show resolved Hide resolved
/// This is useful when the file is known to be valid and the user wants to skip validations.
/// This might be useful when the content is trusted and the user wants to avoid the overhead of
/// validating the content.
pub fn try_new_unvalidated(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like it doesn't set the unvalidated flag. It seems like maybe we can direct the users to the `FileReaderBuilder API if they want to avoid validation

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one was left here due to the changes I made in the last commit.

@@ -1148,6 +1528,30 @@ impl<R: Read + Seek> RecordBatchReader for FileReader<R> {
}
}

/// An iterator over the record batches (without validation) in an Arrow file
pub struct UnvalidatedFileReader<R: Read + Seek> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand the need for a whole different struct -- wouldn't a flag on FileReader (to mirror the flag on StreamReader) be more consistent?

@alamb
Copy link
Contributor

alamb commented Jan 10, 2025

TLDR is I think this is an important PR / feature @totoroyyb and I would like to help make it happen. I will make some initial PRs that hopefully move us forwards

Would you be willing to help write tests and benchmarks?

@totoroyyb
Copy link
Author

@alamb Thanks for reviewing this. Yes, I'd be happy to help with benchmarks and test cases.

Co-authored-by: Andrew Lamb <[email protected]>
@totoroyyb
Copy link
Author

totoroyyb commented Jan 11, 2025

@alamb Regarding adding a flag in the builder, which I did that way in the first commit. tustvold has suggested that we may need to mark unsafe for all functions that skip validation (here).

Seems like with a flag, there is no good way to mark it as unsafe for related functions. Let me know if I misunderstand something. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate
Projects
None yet
3 participants